Manual

#Manual | 来源: 网络整理| 查看: 265

Introduction What is HISAT2?

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) against the general human population (as well as against a single reference genome). Based on GCSA (an extension of BWT for a graph), we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents general population, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover human population). These small indexes (called local indexes) combined with several alignment strategies enable effective alignment of sequencing reads. This new indexing scheme is called Hierarchical Graph FM index (HGFM). We have developed HISAT 2 based on the HISAT and Bowtie2 implementations. HISAT2 outputs alignments in SAM format, enabling interoperation with a large number of other tools (e.g. SAMtools, GATK) that use SAM. HISAT2 is distributed under the GPLv3 license, and it runs on the command line under Linux, Mac OS X and Windows.

Obtaining HISAT2

Download HISAT2 sources and binaries from the Releases sections on the right side. Binaries are available for Intel architectures (x86_64) running Linux, and Mac OS X.

Building from source

Building HISAT2 from source requires a GNU-like environment with GCC, GNU Make and other basics. It should be possible to build HISAT2 on most vanilla Linux installations or on a Mac installation with Xcode installed. HISAT2 can also be built on Windows using Cygwin or MinGW (MinGW recommended). For a MinGW build the choice of what compiler is to be used is important since this will determine if a 32 or 64 bit code can be successfully compiled using it. If there is a need to generate both 32 and 64 bit on the same machine then a multilib MinGW has to be properly installed. MSYS, the zlib library, and depending on architecture pthreads library are also required. We are recommending a 64 bit build since it has some clear advantages in real life research problems. In order to simplify the MinGW setup it might be worth investigating popular MinGW personal builds since these are coming already prepared with most of the toolchains needed.

First, download the source package from the Download section on the right side. Unzip the file, change to the unzipped directory, and build the HISAT2 tools by running GNU make (usually with the command make, but sometimes with gmake) with no arguments. If building with MinGW, run make from the MSYS environment.

HISAT2 is using the multithreading software model in order to speed up execution times on SMP architectures where this is possible. On POSIX platforms (like linux, Mac OS, etc) it needs the pthread library. Although it is possible to use pthread library on non-POSIX platform like Windows, due to performance reasons HISAT2 will try to use Windows native multithreading if possible.

For the support of SRA data access in HISAT2, please download and install the NCBI-NGS toolkit. When running make, specify additional variables as follow. make USE_SRA=1 NCBI_NGS_DIR=/path/to/NCBI-NGS-directory NCBI_VDB_DIR=/path/to/NCBI-NGS-directory, where NCBI_NGS_DIR and NCBI_VDB_DIR will be used in Makefile for -I and -L compilation options. For example, $(NCBI_NGS_DIR)/include and $(NCBI_NGS_DIR)/lib64 will be used.

Running HISAT2 Adding to PATH

By adding your new HISAT2 directory to your PATH environment variable, you ensure that whenever you run hisat2, hisat2-build or hisat2-inspect from the command line, you will get the version you just installed without having to specify the entire path. This is recommended for most users. To do this, follow your operating system’s instructions for adding the directory to your PATH.

If you would like to install HISAT2 by copying the HISAT2 executable files to an existing directory in your PATH, make sure that you copy all the executables, including hisat2, hisat2-align-s, hisat2-align-l, hisat2-build, hisat2-build-s, hisat2-build-l, hisat2-inspect, hisat2-inspect-s and hisat2-inspect-l.

Reporting

The reporting mode governs how many alignments HISAT2 looks for, and how to report them.

In general, when we say that a read has an alignment, we mean that it has a valid alignment. When we say that a read has multiple alignments, we mean that it has multiple alignments that are valid and distinct from one another.

By default, HISAT2 may soft-clip reads near their 5’ and 3’ ends. Users can control this behavior by setting different penalties for soft-clipping (--sp) or by disallowing soft-clipping ([--no-softclip]).

Distinct alignments map a read to different places

Two alignments for the same individual read are “distinct” if they map the same read to different places. Specifically, we say that two alignments are distinct if there are no alignment positions where a particular read offset is aligned opposite a particular reference offset in both alignments with the same orientation. E.g. if the first alignment is in the forward orientation and aligns the read character at read offset 10 to the reference character at chromosome 3, offset 3,445,245, and the second alignment is also in the forward orientation and also aligns the read character at read offset 10 to the reference character at chromosome 3, offset 3,445,245, they are not distinct alignments.

Two alignments for the same pair are distinct if either the mate 1s in the two paired-end alignments are distinct or the mate 2s in the two alignments are distinct or both.

Default mode: search for one or more alignments, report each

HISAT2 searches for up to N distinct, primary alignments for each read, where N equals the integer specified with the -k parameter. Primary alignments mean alignments whose alignment score is equal or higher than any other alignments. It is possible that multiple distinct alignments have the same score. That is, if -k 2 is specified, HISAT2 will search for at most 2 distinct alignments. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Each reported read or pair alignment beyond the first has the SAM ‘secondary’ bit (which equals 256) set in its FLAGS field. See the SAM specification for details.

HISAT2 does not “find” alignments in any specific order, so for reads that have more than N distinct, valid alignments, HISAT2 does not guarantee that the N alignments reported are the best possible in terms of alignment score. Still, this mode can be effective and fast in situations where the user cares more about whether a read aligns (or aligns a certain number of times) than where exactly it originated.

Alignment summary

When HISAT2 finishes running, it prints messages summarizing what happened. These messages are printed to the “standard error” (“stderr”) filehandle. For datasets consisting of unpaired reads, the summary might look like this:

20000 reads; of these: 20000 (100.00%) were unpaired; of these: 1247 (6.24%) aligned 0 times 18739 (93.69%) aligned exactly 1 time 14 (0.07%) aligned >1 times 93.77% overall alignment rate

For datasets consisting of pairs, the summary might look like this:

10000 reads; of these: 10000 (100.00%) were paired; of these: 650 (6.50%) aligned concordantly 0 times 8823 (88.23%) aligned concordantly exactly 1 time 527 (5.27%) aligned concordantly >1 times ---- 650 pairs aligned concordantly 0 times; of these: 34 (5.23%) aligned discordantly 1 time ---- 616 pairs aligned 0 times concordantly or discordantly; of these: 1232 mates make up the pairs; of these: 660 (53.57%) aligned 0 times 571 (46.35%) aligned exactly 1 time 1 (0.08%) aligned >1 times 96.70% overall alignment rate

The indentation indicates how subtotals relate to totals.

Wrapper

The hisat2, hisat2-build and hisat2-inspect executables are actually wrapper scripts that call binary programs as appropriate. The wrappers shield users from having to distinguish between “small” and “large” index formats, discussed briefly in the following section. Also, the hisat2 wrapper provides some key functionality, like the ability to handle compressed inputs, and the functionality for --un, --al and related options.

It is recommended that you always run the hisat2 wrappers and not run the binaries directly.

Small and large indexes

hisat2-build can index reference genomes of any size. For genomes less than about 4 billion nucleotides in length, hisat2-build builds a “small” index using 32-bit numbers in various parts of the index. When the genome is longer, hisat2-build builds a “large” index using 64-bit numbers. Small indexes are stored in files with the .ht2 extension, and large indexes are stored in files with the .ht2l extension. The user need not worry about whether a particular index is small or large; the wrapper scripts will automatically build and use the appropriate index.

Performance tuning If your computer has multiple processors/cores, use -p The -p option causes HISAT2 to launch a specified number of parallel search threads. Each thread runs on a different processor/core and all threads find alignments in parallel, increasing alignment throughput by approximately a multiple of the number of threads (though in practice, speedup is somewhat worse than linear). Command Line Setting function options

Some HISAT2 options specify a function rather than an individual number or setting. In these cases the user specifies three parameters: (a) a function type F, (b) a constant term B, and (c) a coefficient A. The available function types are constant (C), linear (L), square-root (S), and natural log (G). The parameters are specified as F,B,A - that is, the function type, the constant term, and the coefficient are separated by commas with no whitespace. The constant term and coefficient may be negative and/or floating-point numbers.

For example, if the function specification is L,-0.4,-0.6, then the function defined is:

f(x) = -0.4 + -0.6 * x

If the function specification is G,1,5.4, then the function defined is:

f(x) = 1.0 + 5.4 * ln(x)

See the documentation for the option in question to learn what the parameter x is for. For example, in the case if the --score-min option, the function f(x) sets the minimum alignment score necessary for an alignment to be considered valid, and x is the read length.

Usage hisat2 [options]* -x {-1 -2 | -U | --sra-acc } [-S ] Main arguments -x The basename of the index for the reference genome. The basename is the name of any of the index files up to but not including the final .1.ht2 / etc. hisat2 looks for the specified index first in the current directory, then in the directory specified in the HISAT2_INDEXES environment variable. -1 Comma-separated list of files containing mate 1s (filename usually includes _1), e.g. -1 flyA_1.fq,flyB_1.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in . Reads may be a mix of different lengths. If - is specified, hisat2 will read the mate 1s from the “standard in” or “stdin” filehandle. -2 Comma-separated list of files containing mate 2s (filename usually includes _2), e.g. -2 flyA_2.fq,flyB_2.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in . Reads may be a mix of different lengths. If - is specified, hisat2 will read the mate 2s from the “standard in” or “stdin” filehandle. -U Comma-separated list of files containing unpaired reads to be aligned, e.g. lane1.fq,lane2.fq,lane3.fq,lane4.fq. Reads may be a mix of different lengths. If - is specified, hisat2 gets the reads from the “standard in” or “stdin” filehandle. --sra-acc Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654. Information about read types is available at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?sp=runinfo&acc=sra-acc&retmode=xml, where sra-acc is SRA accession number. If users run HISAT2 on a computer cluster, it is recommended to disable SRA-related caching (see the instruction at SRA-MANUAL). -S File to write SAM alignments to. By default, alignments are written to the “standard out” or “stdout” filehandle (i.e. the console). Options Input options -q Reads (specified with , , ) are FASTQ files. FASTQ files usually have extension .fq or .fastq. FASTQ is the default format. See also: --solexa-quals and --int-quals. --qseq Reads (specified with , , ) are QSEQ files. QSEQ files usually end in _qseq.txt. See also: --solexa-quals and --int-quals. -f Reads (specified with , , ) are FASTA files. FASTA files usually have extension .fa, .fasta, .mfa, .fna or similar. FASTA files do not have a way of specifying quality values, so when -f is set, the result is as if --ignore-quals is also set. -r Reads (specified with , , ) are files with one input sequence per line, without any other information (no read names, no qualities). When -r is set, the result is as if --ignore-quals is also set. -c The read sequences are given on command line. I.e. , and are comma-separated lists of reads rather than lists of read files. There is no way to specify read names or qualities, so -c also implies --ignore-quals. -s/--skip Skip (i.e. do not align) the first reads or pairs in the input. -u/--upto Align the first reads or read pairs from the input (after the -s/--skip reads or pairs have been skipped), then stop. Default: no limit. -5/--trim5 Trim bases from 5’ (left) end of each read before alignment (default: 0). -3/--trim3 Trim bases from 3’ (right) end of each read before alignment (default: 0). --phred33 Input qualities are ASCII chars equal to the Phred quality plus 33. This is also called the “Phred+33” encoding, which is used by the very latest Illumina pipelines. --phred64 Input qualities are ASCII chars equal to the Phred quality plus 64. This is also called the “Phred+64” encoding. --solexa-quals Convert input qualities from Solexa (which can be negative) to Phred (which can’t). This scheme was used in older Illumina GA Pipeline versions (prior to 1.3). Default: off. --int-quals Quality values are represented in the read input file as space-separated ASCII integers, e.g., 40 40 30 40…, rather than ASCII characters, e.g., II?I…. Integers are treated as being on the Phred quality scale unless --solexa-quals is also specified. Default: off. Alignment options --n-ceil Sets a function governing the maximum number of ambiguous characters (usually Ns and/or .s) allowed in a read as a function of read length. For instance, specifying -L,0,0.15 sets the N-ceiling function f to f(x) = 0 + 0.15 * x, where x is the read length. See also: [setting function options]. Reads exceeding this ceiling are filtered out. Default: L,0,0.15. --ignore-quals When calculating a mismatch penalty, always consider the quality value at the mismatched position to be the highest possible, regardless of the actual value. I.e. input is treated as though all quality values are high. This is also the default behavior when the input doesn’t specify quality values (e.g. in -f, -r, or -c modes). --nofw/--norc If --nofw is specified, hisat2 will not attempt to align unpaired reads to the forward (Watson) reference strand. If --norc is specified, hisat2 will not attempt to align unpaired reads against the reverse-complement (Crick) reference strand. In paired-end mode, --nofw and --norc pertain to the fragments; i.e. specifying --nofw causes hisat2 to explore only those paired-end configurations corresponding to fragments from the reverse-complement (Crick) strand. Default: both strands enabled. Scoring options --mp MX,MN Sets the maximum (MX) and minimum (MN) mismatch penalties, both integers. A number less than or equal to MX and greater than or equal to MN is subtracted from the alignment score for each position where a read character aligns to a reference character, the characters do not match, and neither is an N. If --ignore-quals is specified, the number subtracted quals MX. Otherwise, the number subtracted is MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) ) where Q is the Phred quality value. Default: MX = 6, MN = 2. --sp MX,MN Sets the maximum (MX) and minimum (MN) penalties for soft-clipping per base, both integers. A number less than or equal to MX and greater than or equal to MN is subtracted from the alignment score for each position. The number subtracted is MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) ) where Q is the Phred quality value. Default: MX = 2, MN = 1. --no-softclip Disallow soft-clipping. --np Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as N. Default: 1. --rdg , Sets the read gap open () and extend () penalties. A read gap of length N gets a penalty of + N * . Default: 5, 3. --rfg , Sets the reference gap open () and extend () penalties. A reference gap of length N gets a penalty of + N * . Default: 5, 3. --score-min Sets a function governing the minimum alignment score needed for an alignment to be considered “valid” (i.e. good enough to report). This is a function of read length. For instance, specifying L,0,-0.6 sets the minimum-score function f to f(x) = 0 + -0.6 * x, where x is the read length. See also: [setting function options]. The default is L,0,-0.2. Spliced alignment options --pen-cansplice Sets the penalty for each pair of canonical splice sites (e.g. GT/AG). Default: 0. --pen-noncansplice Sets the penalty for each pair of non-canonical splice sites (e.g. non-GT/AG). Default: 12. --pen-canintronlen Sets the penalty for long introns with canonical splice sites so that alignments with shorter introns are preferred to those with longer ones. Default: G,-8,1 --pen-noncanintronlen Sets the penalty for long introns with noncanonical splice sites so that alignments with shorter introns are preferred to those with longer ones. Default: G,-8,1 --min-intronlen Sets minimum intron length. Default: 20 --max-intronlen Sets maximum intron length. Default: 500000 --known-splicesite-infile With this mode, you can provide a list of known splice sites, which HISAT2 makes use of to align reads with small anchors. You can create such a list using python hisat2_extract_splice_sites.py genes.gtf > splicesites.txt, where hisat2_extract_splice_sites.py is included in the HISAT2 package, genes.gtf is a gene annotation file, and splicesites.txt is a list of splice sites with which you provide HISAT2 in this mode. Note that it is better to use indexes built using annotated transcripts (such as genome_tran or genome_snp_tran), which works better than using this option. It has no effect to provide splice sites that are already included in the indexes. --novel-splicesite-outfile In this mode, HISAT2 reports a list of splice sites in the file :

chromosome name genomic position of the flanking base on the left side of an intron genomic position of the flanking base on the right strand (+, -, and .)

’.’ indicates an unknown strand for non-canonical splice sites.

--novel-splicesite-infile With this mode, you can provide a list of novel splice sites that were generated from the above option “–novel-splicesite-outfile”. --no-temp-splicesite HISAT2, by default, makes use of splice sites found by earlier reads to align later reads in the same run, in particular, reads with small anchors ( eg2.bam

Use samtools sort to convert the BAM file to a sorted BAM file. The following command requires samtools version 1.2 or higher.

samtools sort eg2.bam -o eg2.sorted.bam

We now have a sorted BAM file called eg2.sorted.bam. Sorted BAM is a useful format because the alignments are (a) compressed, which is convenient for long-term storage, and (b) sorted, which is convenient for variant discovery. To generate variant calls in VCF format, run:

samtools mpileup -uf $HISAT2_HOME/example/reference/22_20-21M.fa eg2.sorted.bam | bcftools view -bvcg - > eg2.raw.bcf

Then to view the variants, run:

bcftools view eg2.raw.bcf

See the official SAMtools guide to Calling SNPs/INDELs with SAMtools/BCFtools for more details and variations on this process.

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章